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Abstract 

High Energy Physics data sets are often characterized by a huge number of events. Therefore, 
it is extremely important to use statistical packages able to efficiently analyze these unprecedented 
amounts of data. We compare the performance of the statistical packages StatPatternRecognition 
(SPR) and Toolkit for Multivariate Analysis (TMVA). We focus on how CPU time and memory 
usage of the learning process scale versus data set size. As classifiers, we consider Random Forests, 
Boosted Decision Trees and Neural Networks. For our tests, we employ a data set widely used 
in the machine learning community, "Threenorm" data set, as well as data tailored for testing 
various edge cases. For each data set, we constantly increase its size and check CPU time and 
memory needed to build the classifiers implemented in SPR and TMVA. We show that SPR is often 
significantly faster and consumes significantly less memory. For example, the SPR implementation 
of Random Forest is by an order of magnitude faster and consumes an order of magnitude less 
memory than TMVA on Threenorm data. 



1. Introduction 

In modern High Energy Physics (HEP) analyses, the use of machine learning techniques has 
become increasingly common. Machine learning classifiers are used to separate signal events from 
unwanted background [1]. Several HEP experiments are characterized by an extremely large num- 
ber of events. Therefore, it is crucial to use statistical packages able to efficiently analyze data sets 
described by million events or even larger. Statistical packages widely used among statisticians, 
such as R or Weka, often implement many statistical techniques that would be extremely useful in 
HEP analyses. However, they are hardly efficient in dealing with huge data sets [2], 

The two statistical software most used in the HEP community are StatPatternRecognition 
(SPR) [3] and Toolkit for Multivariate Analysis (TMVA) [4]. Both packages have been devel- 
oped within the HEP community and are targeted to HEP statistical analyses. In this work, we 
compare how CPU time and memory usage performance scale to data set size in SPR and TMVA. 

A detailed description of the machine learning techniques described in this work can be found 
inRef. [5]. 
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2. SPR and TMVA 



SPR is an open source standalone C++ package that can be run within ROOT [6] or from the 
command line. It implements linear and quadratic discriminant analysis, logistic regression, binary 
decision splits, bump hunter, two flavors of decision trees, a feedforward backpropagation neural 
net with a logistic activation function, several flavors of boosting including the arc-x4 algorithm, 
bagging and random forest. The package also includes two multiclass methods that allow to use 
any binary classifier for a multiclass classification problem. Moreover, it implements algorithms to 
boost or bag any sequence of classifiers as well as combine classifiers trained on subsets of input 
variables. 

TMVA is an open source project integrated with ROOT. It implements the following multivari- 
ate techniques: rectangular cuts, projective and multidimensional likelihood estimators, k-Nearest 
Neighbor, Fisher and H-matrix discriminants, linear and function discriminant analysis, multilayer 
perceptron neural networks, support vector machine, boosted decision trees, random forest, and 
rulefit. It also allows to boost any classifier. 

Both packages implement similar techniques to pre-process the data, such as normalization, 
principal component analysis, correlations, and cuts. Variable importance in TMVA is estimated 
for neural networks by calculating the weight of neural network links and for decision trees by 
calculating the improvement in the classifier performance given by the splits on each variable. In 
addition to these techniques, SPR also presents other variable importance algorithms that work 
with any classifier: random permutation of the class label, "Add N Remove R", and interactions. 

Finally, SPR also includes cross-validation techniques, allows to choose among 10 Figures 
of Merit to optimize the classifier and test its performance, and implements Friedman's machine 
learning-based Goodness of Fit test [7]. 

For a more comprehensive description of SPR and TMVA, please see readme and user's guides 
included in package distributions [3, 4]. A useful description to get started with both packages can 
also be found in Ref. [8]. 

Among the HEP community, TMVA is considered more user friendly and SPR is considered 
more powerful and faster [2, 8]. 

Although a significant number of classifiers are included in both packages, their implementa- 
tion can differ, leading to different results [2]. Different implementation implies that, for instance, 
the best parameters for SPR Boosted Decision Trees (BDT) are not necessarily the best parameters 
for TMVA BDT [8]. 

A detailed description of a statistical analysis to observe the single-top signal, using both 
TMVA and SPR, can be found in Ref. [9]. Firstly, the author separately looks for the classifier with 
the best predictive power from TMVA and from SPR. BDT are the best classifier for TMVA and 
Random Forest (RF) for SPR. Finally, it is shown that SPR RF achieves the best overall predictive 
power. 

In Ref. [9], TMVA RF predictive power is the lowest. One of the reasons of that poor behavior 
could be that TMVA default parameter for RF tree depth equals to 3 (although TMVA user's 
guide reports that tree depth default value is 10 5 ). Also, another issue with TMVA RF is that the 
algorithm automatically sets tree depth value back to 3 when that option is manually set to an 
extremely large value (such as, for instance, 10 10 ). RF typically needs extremely large trees to 
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perform well. Therefore, it is advisable to specifically check TMVA RF tree structure after the 
classifier has been built. 

3. Classifiers and Data Sets 

For our tests, we use release 08.02.00 of SPR and 4.0.4 of TMVA. CPU time reported in this 
work refers to elapsed real time. To estimate memory usage, we consider Resident Set Size (RSS) 
[10]. Memory usage traces are collected by running the top utility at intervals proportional to the 
expected length of the process. Maximum value for RSS is reported here. Both packages quickly 
reach a value close to the maximum and then their memory usage becomes almost constant until 
the end of the process. The executables are run on a dedicated machine with no other major task 
running simultaneously. The machine runs CentOS Linux 5.4 and has 8 Intel(R) Xeon(R) @ 2.33 
GHz processors with 8 GB of RAM. 

As classifiers, the attention is restricted to RF, BDT, and Neural Networks (NN). 

With regards to the most important parameters, for RF we choose 50 trees, at least 5 events 
per leaf, and each split chosen among D/2 variables randomly selected at each node, where D is 
the data set dimensionality. For BDT, we choose 100 cycles and 10% of the events as minimum 
number of events per leaf . ForNN, layer structure is D : D : D/2 : 1. As suggested by TMVA user's 
guide [4], TMVA neural network model is the MultiLayer Perceptron (MLP). All other parameters 
are set so that the packages would perform the same task for each classifier. 

Goal of the analysis is to check how CPU time and memory usage scale to data set size. The 
chosen parameter values are not necessarily the best ones for each data set in terms of classifier 
predictive power. Finding the best classifier configuration for each data set would make hard to 
exactly keep trace of CPU time and memory usage with respect to the data set size. However, the 
chosen parameters are likely to be reasonably close to the best possible configuration for every 
classifier [11, 12]. 

TMVA's variable transformation options are disabled in order to achieve a major similarity 
between the tasks executed by the packages. TMVA's variable transformation options, which are 
on by default in the code example, include decorrelation, principal component decomposition, 
and gaussianisation. They consume a very large amount of memory. For instance, those variable 
transformation options take up to 5 or 6 times the maximum amount of memory needed to build 
BDT. The bigger is the data set, the larger is the ratio between the memory needed for variable 
transformation and for training the BDT classifiers. SPR does not need such an adjustment since 
uses different executables to perform different tasks. 

As a benchmark data set, we choose the "Threenorm" data set introduced by Breiman in Ref. 
[12]. Given a= 2/ (20) 1//2 , one class is drawn from a unit multivariate normal with mean (a, -a, a, -a, 
....,a). The other class is drawn with equal probability from a unit multivariate normal with mean 
(a,a,...,a) and from a unit multivariate normal with mean (-a,-a,...,-a). This data set is considered a 
difficult one for classification problems [12] and has been widely used for comparison of machine 
learning algorithms [13]. 

Moreover, we train our classifiers on edge cases. That is, data sets described by a single variable 
which follows an unusual distribution. Example of edge cases are a random noise variable, a 
variable for which all events except for one have the same value, and a variable for which 50% of 
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Figure 1: Threenorm data set: BDT learning CPU time vs number of events for 20 variables (left) and 100 variables 
(right). 
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Figure 2: Threenorm data set: BDT memory usage vs number of events for 20 variables (left) and 100 variables (right). 

events are useless and 50% have discriminant power. With respect to edge cases, we only test the 
behavior of RF and BDT. 

Results refer to the training part only. We start with small data sets and gradually increase 
dimensionality and number of events. We start with 2 x 10 4 events and 20 variables. For BDT, 
we gradually increase dimensionality up to 100 variables and size up to 4 x 10 6 events. For RF, 
being slower given the optimization parameters, we increase up to 100 variables and 10 6 events; 
and for NN, the slowest among the three, up to 70 variables and 10 6 events. For edge cases, we 
only increase the number of events as described above. 



4. Test on "Threenorm" Data Set 

We test learning time and memory usage on Threenorm data set for BDT, RF, and NN. Fig. 1 
shows a comparison of CPU time needed by SPR and TMVA to build BDT with 20 variables (left) 
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Figure 3: Threenorm data set: RF learning CPU time vs number of events for 20 variables (left) and 100 variables 
(right). 
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Figure 4: Threenorm data set: RF memory usage vs number of events for 20 variables (left) and 100 variables (right). 

and 100 variables (right). CPU time of tree-based classifiers is expected to grow linearly versus 
NlogN, where N is the number of events. SPR BDT are constantly faster than TMVA BDT. We 
also check configurations with 40 and 70 variables. Results are as expected. That is, they are in 
between the 20 and 100 variables outputs. 

Fig. 2 shows SPR and TMVA memory usage for 20 variables and 100 variables. TMVA con- 
sumes more memory in both configurations. 

The trees built by SPR and TMVA have a very similar structure, with their depth being equal to 
either 2 or 3. This depends on the optimization parameters chosen, i.e. 10% of events as minimum 
number of events per leaf. Therefore, this analysis gives a useful indication of time and memory 
difference to build classifiers based on small trees. 
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Figure 5: Threenorm data set: NN learning CPU time vs number of events for 20 variables (left) and 70 variables 
(right). 



NN - 20 Vars 



NN - 70 Vars 




010 210 410 610 810 1 10 

Number of Events 




010 210 410 610 810 1 10 

Number of Events 



Figure 6: Threenorm data set: NN memory usage vs number of events for 20 variables (left) and 70 variables (right). 



We repeat the same test using RF classifiers. CPU time results are shown in Fig. 3. SPR is 
again faster than TMVA. The difference between SPR and TMVA is larger than for the BDT test. 
RF trees are larger than BDT trees. Therefore, bigger trees lead to a larger difference in terms of 
performance between SPR and TMVA. Furthermore, the ratio between TMVA and SPR CPU time 
increases as the data set size gets larger. That is, SPR CPU time increases linearly with respect to 
NlogN and TMVA CPU time increases more than linearly. 

Fig. 4 shows memory usage results for the RF test. TMVA memory usage is always larger than 
SPR's one. 

SPR and TMVA NN CPU time is shown in Fig. 5 for 20 variables (left) and 70 variables (right). 
Again, SPR is always faster than TMVA for any data set size. However, the difference between 
SPR and TMVA is not as large as in the BDT and RF cases. 

Fig. 6 shows the corresponding memory usage for the NN test. SPR consumes less memory 
for all the configurations we tested. 
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Figure 7: "Uniform" data set: RF (left) and BDT (right) learning CPU time vs number of events. 
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Figure 8: "Uniform" data set: RF (left) and BDT (right) memory usage vs number of events. 

5. Test on "Edge Case" Data Sets 

We test SPR and TMVA on a data set with events characterized by a single variable following 
the same uniform distribution for signal and background. We define this data set as "Uniform". 
Fig. 7 shows CPU time for RF (left) and BDT (right). The difference between SPR and TMVA for 
RF is extremely large. SPR is faster for BDT too. Again, RF builds large trees, while BDT trees 
are forced to be small by the optimization parameters. 

Fig. 8 shows TMVA and SPR memory usage for BDT and RF. Memory consumption mirrors 
CPU time. TMVA always consumes more memory than SPR. 

Another edge case data set is characterized by a single variable that has the same value for all 
the events, except for one. We define it as "Same value" data set. As expected, learning time in 
this case is fast for both packages. CPU time outputs are shown in Fig. 9. SPR is always faster 
than TMVA. In this test, SPR and TMVA grow similar trees for both RF and BDT. 

Fig. 10 shows TMVA and SPR memory usage for BDT and RF. TMVA consumes constantly 
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Figure 9: "Same value" data set: RF (left) and BDT (right) learning CPU time vs number of events. 
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Figure 10: "Same Value" data set: RF (left) and BDT (right) memory usage vs number of events. 

more memory than SPR for any classifier and any data set size. 

SPR and TMVA are also tested on the "Same value and separation" data set, which is described 
by one variable that has the same value for 50% of the events and, for the remaining 50%, follows 
the "Threenorm" distribution. That is, only half of the events are relevant to separate signal from 
background, with the remaining half being useless. Results shown in Fig. 11 and Fig. 12 are in 
substantial agreement with the results in Sec. 4 and Sec. 5, with SPR always having a better 
performance than TMVA. 

6. Summary 

We have evaluated CPU time and memory usage for different classifiers implemented in SPR 
and TMVA. As benchmark data, we used the Threenorm data set and targeted edge cases. In every 
single test we performed, SPR has been faster than TMVA and consumed less memory. For RF 
and BDT, the difference between SPR and TMVA is particularly large. NN performance of the two 
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Figure 11: "Same value and separation" data set: RF (left) and BDT (right) learning CPU time vs number of events. 
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Figure 12: "Same value and separation" data set: RF (left) and BDT (right) memory usage vs number of events. 



statistical packages is closer. For tree-based classifiers, the ratio between TMVA and SPR CPU 
time increased as the number of events increased. In all other tests, the ratio between SPR and 
TMVA tended to be constant when increasing the data set size. 
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